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Most optimal routing problems focus on minimizing travel 
time or distance traveled. Oftentimes, a more useful objec- 
tive is to maximize the probability of on-time arrival, which 
requires statistical distributions of travel times, rather than 
just mean values. We propose a method to estimate travel 
time distributions on large-scale road networks, using probe 
vehicle data collected from GPS. We present a framework 
that works with large input of data, and scales linearly with 
the size of the network. Leveraging the planar topology of 
the graph, the method computes efficiently the time correla- 
tions between neighboring streets. First, raw probe vehicle 
traces are compressed into pairs of travel times and num- 
ber of stops for each traversed road segment using a 'stop- 
and-go' algorithm developed for this work. The compressed 
data is then used as input for training a path travel time 
model, which couples a Markov model along with a Gaussian 
Markov random field. Finally, scalable inference algorithms 
are developed for obtaining path travel time distributions 
from the composite MM-GMRF model. We illustrate the 
accuracy and scalability of our model on a 505,000 road link 
network spanning the San Francisco Bay Area. 

Categories and Subject Descriptors 

H. 4 [Information Systems Applications]: Miscellaneous; 
D.2.8 [Software Engineering]: Metrics — complexity mea- 
sures, performance measures 

General Terms 

estimation, machine learning, inference 

I. INTRODUCTION 

A common problem in trip planning is to make a given 
deadline, for example arriving at the airport within 45 min- 
utes. Most routing services available today minimize the 
expected travel time, but do not provide the most reliable 
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route, which accounts for the variability in travel times. 
Given a time budget, a routing service should provide the 
route with highest probability of on-time arrival, as posed 
in stochastic on-time arrival (SOTA) routing 14 . Such an 
algorithm requires the estimation of the statistical distribu- 
tions of travel times, or at least of their means and variances, 
as done in [ll] . Today, only few traffic information platforms 
are available for the arterial network (the state of the art for 
highway networks is more advanced) and they do not pro- 
vide the statistical distributions of travel times. The main 
contribution of the article is precisely to addresses this gap: 
we present a scalable algorithm for learning path travel time 
distributions on the entire road network using probe vehicle 
data. 

The increasing penetration rate of probe vehicles provides 
a promising source of data to learn and estimate travel time 
distributions in arterial networks. At present, there are two 
general trends in estimation of travel times using this probe 
data. One trend, from kinematic wave theory (see [1S\ [t]), 
derives analytical probability distributions of travel times 
and infer their parameters with probe vehicle data. These 
approaches are computationally intensive, which limits their 
applicability for large scale networks. The other trend, seen 
in large-scale navigation systems such as Google Maps, pro- 
vides coarser information, such as expected travel time, but 
can scale to world-sized traffic networks. 

We bridge the two trends by creating a travel time es- 
timator that (i) provides full probability distributions for 
arbitrary paths in real-time (sub-second), (ii) works on net- 
works the size of large cities (and perhaps larger) (iii) and 
accepts an arbitrary amount of input probe data. The model 
uses a data-driven model which is able to leverage physical 
insight from traffic flow research. Data-driven models, us- 
ing dynamic Bayesian networks [g], nearest neighbors 16 or 
Gaussian models 17 show the potential of such methods to 
make accurate predictions when large amounts of data are 
available. 

The main physical insights modeled in the article are de- 
scribed in the following paragraphs. First, arterial traffic is 
characterized by important travel time variability amongst 
users of the network (Figure [T]) . This variability is mainly 
due to the presence of traffic lights and other impediments 
such as pedestrian crossings and garbage trucks which cause 
a fraction of the vehicles to stop while others do not. Arte- 
rial traffic research [t] suggests that the detection of stops 
explains most of the variability in the travel time distribu- 
tion and underline the multi-modality of the distributions. 
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Figure 1: Histogram of travel times collected on a 
link fit (solid line) using a mixture of Gaussian dis- 
tributions. 

Second, the number of stops on a trajectory exhibits strong 
spatial and temporal correlations. Traffic lights on major 
streets may be synchronized to create "green waves": a vehi- 
cle which does not stop on a link is likely to not stop on the 
subsequent link. A different vehicle arriving 10 seconds later 
may hit the red light on the first link and have a high prob- 
ability to stop on the subsequent link. This phenomenon 
is analyzed in 13 using a Markov model with two modes 
("slow" and "fast"). 

Third, besides the the number of stops, travel times may 
be correlated for the following reasons: (i) the behavior of 
individual drivers may be different: some car may travel no- 
tably faster than some others, (ii) congestion propagates on 
the network, making neighboring links likely to have similar 
congestion levels. If a driver experiences a longer than usual 
travel time on a link because of heavy traffic, he / she can 
will likely experience heavy traffic on the subsequent links. 

We leverage these insight to develop the traffic estimation 
algorithm presented in this article: an end-to-end, scalable 
model for inferring path travel time distributions, referred 
to as a "pipeline" (see Figure |2| . It consists of a learning 
algorithm and an inference algorithm. 

The learning algorithm consists in the following steps. 

• Section |2] a Stop- &- Go filter algorithm detects the num- 
ber of stops on a link and compresses the GPS trace^into 
values of travel times on traversed links and corresponding 
number of stops. 

• Section 3.1 a Markov model (MM) captures the corre- 
lations of stopping / not stopping for consecutive links. It 
characterizes the probability to stop / not stop on a link 
given that the vehicle stopped or did not stop on the previ- 
ous link traversed. The Stop- &- Go filter produces a set of 
labeled data to train the Markov model. 

a Gaussian Markov Random Field (GMRF) 



Section 3.2 



captures the correlations in travel times between neighbor- 
ing links, given the number of stops on the links. There is 
a significant body of prior work in the field of learning with 
graphical models [10l [9], especially for learning problems on 
sparse GMRFs [5l |l2| . Most of these algorithms do not scale 
linearly with respect to the dimension of the data, and are 
unsuitable for very large problems (hundreds of thousands 
of variables). In particular, it becomes practically impossi- 
ble to store the entire covariance matrix, so even classical 
sub-gradient methods such as ^ would require careful en- 
gineering. 

We exploit the (near) planar structure of the underlying 



^Before feeding raw GPS points to the Stop- &- Go filter, each 
coordinate is mapped to a link and an offset distance from 
the beginning of the link. We use an efficient path-matching 
and path- inference algorithm developed in 
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Figure 2: Pipeline of the travel-time estimation 
model. The learning runs offline and finds optimal 
parameters for the Markov model and GMRF mod- 
els. The inference runs online and uses the learned 
parameters to produce travel time distributions on 
input paths. 



graphical model to more efficiently obtain (approximate) al- 
gorithms that scale linearly with the size of the network. 
Our algorithm leverages efficient algorithms to compute the 
Cholesky decomposition of the adjacency matrix of planar 
graphs 2 . Our results can be extended to other physical 
systems with local correlations. 

After the learning, we proceed to the inference algorithm: 
we compute the travel time distributions for arbitrary paths 
in the network (Section [4]). While exact inference on this 
model is intractable due to the number of possible states, 
we exploit the underlying structure of the graphical model 
and use a specialized sampling method to obtain an efficient 
inference algorithm. 

Section [5] illustrates the accuracy and scalability of our 
model on a 505,000 road link network spanning the San 
Francisco Bay Area. 

Our code, as well as a showcase of the model running on 
San Francisco, is available at http: //traffic. berkeleyT] 
|edu/iiavigateSF| 

2. STOP-&-GO MODEL FOR VEHICLES TRA- 
JECTORIES 

The stops due to traffic signals and other factors (double 
parking, garbage trucks and so on) represent one of the main 
source of variability in urban travel times. More generally, 
consider that a link can have m different discrete states. 
For a vehicle traveling on link /, the state si G {0,m — 1} 
of the trajectory is defined as the number of stops on the 
link. The following algorithm estimates the number of stops 
given a set of noisy GPS samples from the trajectory on link 
/. We consider a generic trajectory on a generic link and 
drop indices referring to the trajectory and link for notation 
simplicity. 

The trajectory of the vehicle is represented by an offset 
function T : [0,r] R+, representing the distance from 
the beginning of the link to the location of the vehicle at 
time t. The noisy GPS observations are defined by the 
times = to < • • • < tj < T, and the corresponding offsets 
Xj — T(tj) + Cj, where ej ^ A/'(0, cr^) are independent and 
identically distributed zero mean Gaussian random variables 
representing the GPS noise. 



We process the observations (tj , Xj)j=o,..., j to obtain stop 
and go trajectories of the probe vehicles: the trajectory of 
a vehicle alternates between phases of Stop during which 
the velocity of the vehicle is zero and Go during which the 
vehicle travels at positive speed. The number of Stop phases 
represents the state of the trajectory. We assume that the 
sampling frequency is high enough that the speed between 
successive observations {tj,Xj) and (tj+i, Xj+i) is constanl]^ 
and denoted Vj. Note that speeds are rarely provided by 
GPS devices or are too noisy to be valuable for estimation. 

Maximizing the log-likelihood of the observations is equiv- 
alent to solving the following optimization problem 



minimize 

(^i)je{o,...,j-i} 
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This is a typical least-square optimization problem, which 
we conveniently write in matrix form as: 



minimize- 1 1 At' — b\\2, 
V 2 



(1) 



with the notation = (i;j)jg{o,..., j-i} , 6= (xj — xo)je{i,..., j} 
and A is the lower triangular matrix whose entry on line 
i G {0, . . . , J — 1} and column /c G {0, . . . , i — 1} is given by 
tk+i — tk- 

To detect the Stop phases, we add a h regularization term, 
with parameter A to Problem ([T]). The resulting optimiza- 
tion problem is known as the LASSO [15] and reads 



minimize - 1 \Av 

V 2 



&ll2 + A||t;||i 



(2) 



The solution of ^ is typically sparse 15 , which means that 
there is a limited number of non-zero entries, corresponding 
to the Go phases of the trajectory. The solution is used to 
compute the number of Stops. Figure [3] illustrates the result 
of the trajectory estimation and Stop detection algorithm. 

Remark 1 (Data compression). In our datase^ the 
average number of GPS points sent by a vehicle on a link is 
9.6. The algorithm reduces the GPS trace to: entrance time 
in the link, travel time, number of stops. The amount of 
data to be processed by the subsequent algorithms is reduced 
by a factor of almost 10, which is crucial for large scale 
applications which process large amounts of historical data. 



Remark 2 (Fixing A), /n ([2|, A acts as a trade-off be- 
tween the sparsity of the solution and the fit to the observa- 
tions. Gross-validation is not appropriate in our setting to 
fix this parameter as it would require one to decimate the 
trajectory and use some observations for the learning and 
others for the validation. Instead, A is chosen by comput- 
ing the Bayesian Information Criterion (BIG), using ^191 to 
estimate the number of degrees of freedom. A LARS imple- 
mentation allows efficient computation to choose A and 
compute the corresponding solution. 



^The assumption is further justified by traffic modeling IT] 
which commonly assumes that each Go phase has constant 
speed 

^The number of GPS points per link depends on the level 
of congestion (vehicles spend a longer amount of time on 
each link), average length of the links of the network and 
sampling frequency of the probe vehicles. 




Figure 3: Example of trajectory estimation 



3. TRAVEL TIME MODEL 

We develop a travel time model which exploits the com- 
pressed data returned by the Stop-&-Go filter (number of 
stops and travel time experienced per link) . The model cap- 
tures the state transition probabilities: the probability of 
the number of stop on a link given the number of stops on 
the previous link of the trajectory. It also models the corre- 
lations of travel times for neighboring links given their state 
(number of stops). The travel time model is a combination 
of two models: 

• A directed Markov model of discrete state random vari- 
ables gives the joint probability distribution of the link states 
Pe{{Si}). 

• A Gaussian Markov random field gives the joint distribu- 
tion of the travel times P6»({ll,s}, / G £, s G {0, . . . , m — 1}). 




Figure 4: Graphical model of the dependency of link 
travel time on the state variable (e.g. number of 
stops) Si and the conditional travel times Y^ s. Here 
we consider the subgraph consisting of a link /, up- 
stream link u and downstream link d on a given path. 

Figure|4]presents the graphical model that encodes the de- 
pendencies between the link travel time , the link states Si 
and the conditional travel times Yi^s- The total travel time 
experienced by a vehicle on link / , is = ^T^o^ ^i,sl{S;=s}- 
The left portion of the figure shows a subset of the GMRF of 
travel time variables, and the right portion shows a subset 
of the Markov chain of states. 

The graphical model shows that conditioning on the states 
experienced along a path allows one to compute the path 
travel time by summing over the corresponding variables in 
the GMRF. Further, when one conditions on the link travel 
times Z\ then the two models become independent, which 
allows one to learn the models parameters separately. The 



rest of this section details the modeUng and learning of the 
two models. 

3.1 Markov model for state transitions 

We consider the state variables {Si}i^^c- Each variable Si 
has support {0, . . . , m — 1}. 

Given the path p = {lo, ... ,1m) of a vehicle, the variables 
{Si.}i^^i^___^M} have a Markov property, i.e. given the state 
of the upstream link h-i, the conditional state {Si-\Si-_-^^) is 
independent of the state of other upstream links: 



nsi.\{Su.^,---,Sio}) = i>{SiASu.,) 



(3) 



We parametrize the model using an initial probability vec- 
tor 

Tri = F{Si = s) (4) 

and a transition probability matrix 

Tl-:l=P{Si = si\Su = Su) (5) 

here T^^^^ is the probability that / is in state si given that 
the upstream link u is in state Su. 

We learn the initial probability vector tt^ and the tran- 
sition probability matrices T^^^ of the Markov Chain by 



maximizing the likelihood of observing (s 



(n) y{n) 



The log- 



likelihood is given by 

^ ( (-) 

\ u ^ I 

The parameters that maximize the log-likelihood are: 
and 



I -I 



Su ,Sl 



where Tf^ are the empirical counts of initial states and 
are the empirical transition counts. This solution corre- 
sponds to transitions and initial probabilities that are con- 
sistent with the empirical counts of initial states and tran- 
sitions. 

3.2 GMRF model for travel time distributions 

We now present the model that describes the correla- 
tions between the travel time variables. We assume that 
the random variables {yi,s)i^p ^^^^ are GaussiarQ and they 
can be stacked into one multivariate Gaussian variable Y ^ 
J\f (/i, E) of size m card(£), mean fi and covariance S. Recall 
that Yi^s is the travel time on link / conditioned on state s, 
where s G {0,...,m — 1}. This travel time is a Gaussian 
random variable with mean iii^s and variance ai^s- 

From the factorization property given by the Hammersley- 
Clifford Theorem, it is well known that the precision matrix 
S — Ti~^ encodes the conditional dependencies between the 
variables. Since a link is assumed to be conditionally cor- 
related only with its neighbors, the precision matrix is very 
sparse. Furthermore, this sparsity pattern is of particular 
interest: its pattern is that of a graph which is nearly pla- 
nar. We take advantage of this structure to devise efficient 
algorithms that (1) estimate the precision matrix given some 
observations, and (2), infer the covariance between any pair 
of variables. 



^While Gaussian travel times can theoretically predict nega- 
tive travel time, in practice, these probabilities are virtually 



zero, as validated in Section |5] 



As mentioned earlier, we have a set of N trajectories ob- 
tained from GPS data. After map-matching and trajectory 
reconstruction (Section [5]), the set of observed trajectories 
{Wp : p = 1, • • * 5 ^} are sequences of observed states and 
variables (travel time): 



(/l,Sl) (/2,S2) • • • {lMp,SMp) 



'{l,s)eWp 

In our notations, is an observation of the random vec- 
tor Yp = (^OiGWp' which is a Mp-dimensional marginal (or 
subset) of the full distribution Y. Hence Yp is also a mul- 



tivariate Gaussian with mean 



and covariance S 



obtained by dropping the irrelevant variables (the variables 
that one wants to marginalize out) from the mean vector 
/J and the covariance matrix S. Its likelihood is thus the 
likelihood under the marginal distribution: 



log p 
C - I log 



y^'-^^iWp), 



-{Wp) 



<Wp)) - 



- ^(Wp)) 



^(Wp) 



.y -^(Wp)) 



(6) 

where C is a constant which does not depend on the param- 
eters of the model. 

The problem of estimating the parameters of the model 
— (/J, Ti) from the i.i.d. set of observations V = {Wp,y^ : 
p = 1, • • • , N} consists in finding the set of parameters 0'^ 
that maximize the sum of the likelihoods of each of the ob- 
servations : 



l{0\V) = Y.\ogp{y^;fi^Wp),^(Wp) 



(7) 



Unfortunately, the problem of maximizing ([7| is in gen- 
eral not convex and may have multiple local minima since we 
have only partially observed variables y^ . A popular strat- 
egy in this case is to complete the vector (by computing the 
most likely completion given the observed variables). This 
algorithm is called the Expectation- Maximization (EM) pro- 
cedure. In our case, the EM procedure is not a good fit for 
two reasons: 

• Since we observe only a small fraction of the values of each 
vector, the vast majority of the values we would use for 
learning would be sampled values, which would make the 
convergence rate dramatically slow. 

• The data completion step would create a complete n— size 
sample for each of our observation, thus our complexity 
for the data completion step would be O (Nn), which is 
too large to be practical. 

Instead, we solve a related problem by computing sufficient 
statistics from all the observations. Consider the simpler 
scenario in which all data has been observed, and denote 
the empirical covariance matrix by S. The maximum like- 
lihood problem to find the most likely precision matrix is 
then equivalent to: 



minimize 

s 



log|5|+Tr(5s) 



(8) 



under the structured sparsity constraints Suv = V {u,v) ^ 
S. The objective is not defined when S is not positive defi- 
nite, so the constraint that S is positive definite is implicit. 
A key point to notice is that the objective only depends on 
a restricted subset of terms of the covariance matrix: 



Tr 



(Si) 



(u,v)e£ 



This observation motivates the fohowing approach: in- 
stead of considering the individual hkehhoods of each obser- 
vation individuahy, we consider the covariance that would 
be produced if all the observations were aggregated into a 
single covariance matrix. This approach discards some in- 
formation, for example the fact that some variables are more 
often seen than others. However, it lets us solve the full co- 
variance Problem Q using partial observations. Indeed, all 
we need to do is estimate the values of the coefficients Huv 
for (ix, v) a non-zero in the precision matrix. We present one 
way to estimate these coefficients. 

Let Ni be the number of observations of the variable Yi\ 
Ni = card ({p : i G Wp}). Combining all the observations 
that come across Yi, we can approximate the mean of any 
function / (Yi) by some empirical mean, using the Ni sam- 
ples: 



Ni 



(9) 



Similarly, defining the number of observations of both Yi 
and Yj-. Ninj = card {{p : i G Wp and j G Wp}), we can ap- 
proximate the mean of any function / (Yi, Yj) of Yi,Yj, using 
the set of observations that span both variables Yi and Yj : 



E,n,[/m,y,)] = ^ E /(yf-y?) (10) 

Using this notation, the empirical mean is fn = [Yi\. 
Call £ the partial empirical covariance matrix (PECM): 







otherwise 



Using this PECM as a proxy for the real covariance ma- 
trix, one can then estimate the most likely GMRF by solving 
the following problem: 



minimize 

s 



log|5| + (5,S 



(11) 



Note that this definition is asymptotically consistent: in 
the limit, when an infinite number of observations are gath- 
ered {Nij oo), the PECM will converge towards the true 
covariance: indeed S^j ^ E [YiYj] — E [y^] E [Yj] and for all 
S, (^S, ^ (^S, E [YY^] - E [y] E [Yf^ 

Unfortunately, the problem is not convex because E is not 
necessarily positive semi-definite (even if the limit is), since 
the variables are only partially observed. For instance, if 
we have a partially observed bivariate Gaussian variable X: 
(10,10), (-10,-10), (1,^), (-1,^), (★, 1), (^,-1), the em- 
pirical covariance matrix E has diagonal entries (51,51) and 
off-diagonal entries (100,100). Its eigenvalues are —49,151 
hence it is not definite positive. 

There is a number of ways to correct this. The simplest 
we found is to scale all the coefficients so that they have the 
same variance: 



^ _ j ai jE, 



.n,[l^lS-]-E.[y.]E,K-] ifiV.^ 

otherwise 



This transformation has the advantage of being local and 
easy to compute. This is why it is completed by an increase 
of the diagonal coefficients by some factor of the identity 
matrix. 

Another problem is due to the relative imbalance between 
the distributions of samples: cars travel much more on some 
roads than others. This means that some edges may be much 
better estimated than some others, but this confidence does 
not appear in the PECM. In practice, we found that pruning 
the edges with too few observations improved the results 

4. INFERENCE 

Given the model parameters, the inference task consists 
in computing the probability distribution P(Z*^^'* < t) of the 
total travel time along a fixed path (p) = (po, • • • ,Pi)- The 
path travel time (i.e. total travel time along the path) is 
given by 



zip) 



where S — {l,...,m}^ is the set of possible path states, 

(v) 

and is the path travel time given the path state s, and 

is given by Z^f = T^iep^i,si = e{p,s)^Y. Here, e(p, s) 
is a binary vector that selects the appropriate entries in the 
vector of travel times Y (corresponding to path p with states 
s): 

e(p, s)(i,s;) = 1 ^ I e p and s[ = si 

This vector e(p, s) is very sparse and has precisely / non-zero 
entries. Using the law of total probability we have: 



P (z^^^) = ^ P(5' = 5) P (zff ^) 



(12) 



The variable z[f ^ 



e{p,s)^Y is a linear combination of the 
multivariate variable Y, and so is also normally distributed: 

' A/" (^e(p, s)^/i, e(p, s)^Ee(p, s)j (13) 

The marginal distribution of travel times along a path is a 
mixture of univariate Gaussian distributions. There is how- 
ever two problems for this algorithm to be practical, (i) The 
mixture from Equation (12) contains a term for each pos- 
sible combination of states, and has size . Enumerating 
all the terms is impossible for moderately large lengths of 
paths, (a) In order to compute the variance of each distri- 
bution of the mixture, one needs to estimate the covariance 
matrix E. However storing (or computing) the complete co- 
variance matrix is prohibitively expensive with millions of 
variables. 

We find tractable solutions for (i) by using a sampling 
method on the Markov model to choose a tractable number 
of states, and (ii) by using a random projection method to 
construct a low-rank approximation of the covariance ma- 
trix E. Note that the mean of the complete distribution can 
be computed exactly in 0{lm^) time, using a dynamic pro- 
gramming algorithm. Thus, our model is also applicable to 
situations that do not require the full distribution. We do 
not present this algorithm further due to space limitations. 

Gibbs sampling. The sequence of state variables {Si}i for 
a given path form a Markov chain, with initial probability 



7r° and transition probability matrices which are 

given by the trained Markov model in Section \3l] We can 
sample from the Markov chain by first sampling sq from the 
categorical distribution with parameter 7r° , then sequentially 
sample Si from the conditional distribution of Si\Si-i. 

Using this procedure, we generate K samples s^, . . . , 
from <S, the set of possible states. The complete (exponen- 
tial) distribution can be approximated with the empirical 
distribution: 



(14) 



in which each weight P is the fraction of samples corre- 
sponding to the state s: 



ujp^s — -TT card{/c|s = s} 
K 



(15) 



The sum ( 14 ) contains at most K terms, since at most K 



approximate weights are positive, and converges towards 
the true distribution of states as the number of samples in- 
creases. In practice, numerical experiments showed that it 
was only necessary to sample a relatively small number of 
states. Section [5] includes a numerical analysis and evalua- 
tion of the sampling method. 

Low rank covariance approximation. Compute the vari- 
ances of each sequence of variables: 

e(s)^Se(s) 

with s being a valid sequence of variables in the graph of 
variables. 

Using the full covariance matrix S to estimate the covari- 
ance of each path cr (s)^ = e s)^ Ee (p, s) is impractical 
for two reasons: as mentioned before, we cannot expect to 
compute and access the full covariance matrix, and also the 
sum e (p, s)^ Se (p, s) sums elements from the covariance 
matrix. Since we do not need to know the variance terms 
with full precision, an approximation strategy using random 
projections is appropriate. More specifically, we use the fol- 
lowing result from ^: 

Given some fixed vectors vi ■ ■ -Vn G and e > 0, let 
^ ^ j^fcxd ^ random matrix with random Bernoulli entries 
=bl/v^ and with k > 24e~^logn. Then with probability at 
least 1 — 1/n: 

(l-e)||^;.f < \\Rv,f < (l + e)||^;.f 



Call R such a matrix. Consider the Cholesky decomposi- 
tion of the precision matrix, S = LL^ . Then 

cr(s)^ = e (p, s)^ Se (p, s) 

= \\L-\{p,s)\\' 
From the following lemma, we can approximate this norm: 



(s) = Q^e(p,s) = 



Si) 



(16) 



with Q = L~^R and R defined as obtained from the lemma 
above. Computing the approximate variance requires the 
addition of / vectors of size k. In practice, this summing 
operation is vectorized and very fast. 



This method assumes we can efficiently compute the Cholesky 
factorization, and that inversion operation L~^x is efficient 
as well. In our case, the graph of the GMRF is nearly pla- 
nar. Some very efficient algorithms exist that compute the 
Cholesky factorization in near-linear time ([2] ). In practice, 
computing the Q matrix is very fast. 

5. EVALUATION 

The article presents an algorithm to turn GPS traces from 
a small fraction of the vehicles traveling on the road network 
into valuable traffic information to develop large scale traf- 
fic information platforms and optimal and robust routing 
capabilities. The value of the model depends crucially on 
the quality of the estimates of point to point travel time dis- 
tribution. Another key feature of the algorithm is its com- 
putational complexity and its ability to scale on large road 
networks. The large number of road segments and intersec- 
tions leads to high-dimensional problems for any network 
of reasonable size. Moreover, the estimate for travel time 
distributions between any two points on the network needs 
to be computed in real-time. It would not be acceptable to 
wait more than a few seconds to get the results. This section 
first analyzes the quality of the learning and the estimation 
of the model. Then, we study the computational complexity 
of the algorithm and its ability to scale on large networks. 
In particular, some aspects of the algorithm arise as trade- 
offs between the computational complexity and the desired 
precision in the learning and real-time inference. 

The validation results are based on anonymized GPS traces 
provided by a GPS data aggregator. We consider an arterial 
network in the Bay Area of San Francisco, CA with 506,585 
links. The algorithm processes 426 million GPS points, ag- 
gregated from 2,640,319 individual trajectories. Each tra- 
jectory is less than 20 minutes long for privacy reasons. 

5.1 Travel time distribution 

The full validation of the performance of the algorithm re- 
quires the observation of the travel time of every vehicle on 
every link of the network. This mode of validation is unfor- 
tunately not available for any reasonably sized network. We 
validate the learning capabilities of the algorithm using data 
which was not used to train the Markov model GMRF. On 
this validation dataset, we perform two types of validation: 
(i) a path-level validation (a limited set of individual paths 
are evaluated) and (ii) a network-level validation (metrics 
taken over the entire validation dataset). 

Comparison models 

In this Section, the results of the model are compared to 
simpler models which arise as special cases of the model 
presented in the article. We introduce these models and de- 
nominations, which we use throughout the evaluation. 

• One mode independent: The travel time distribution on 
each segment is Gaussian. The travel time on distinct seg- 
ments are independent random variables. 

• One mode: The travel time distribution on the network 
is a multi-variate Gaussian (one dimension per link). In the 
precision matrix, element (i, j) may be non-zero if i and j 
map to neighboring links in the road network. 

• Multi-modal independent: Same as the MM-GMRF, ex- 
cepted that the covariance matrix of the multi-variate Gaus- 
sian is diagonal, imposing that given the mode, the travel 
times on different links of the network are independent. 
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Figure 6: Cumulative distribution of travel times 
computed by the different models and compared to 
the validation travel times received on the selected 
paths. 



Figure 5: 50% and 90% confidence intervals com- 
puted by the different models and compared to the 
validation travel times on the selected paths. 



The model developed in this article is referred to as the 
MM-GMRF model. 

Path validation 

Most probe vehicles have different paths throughout the 
road network. Among the trajectories of the vehicles in 
the validation set, we select a set of paths for which a large 
enough number of vehicles has traveled to perform statistical 
validation of the distribution of travel times. We impose a 
minimum length (set to 150 m in the numerical experiments) 
on these paths to ensure that we validate t he le arning of the 
spatial distributions of the modes (Section |3.1|) a nd the spa- 
tial correlations between each mode (Section |3.2| ). The paths 
are selected for having the largest amount of validation data. 

For each selected path, the box plots on Figure [5] com- 
pare the 50% and 90% confidence intervals of the valida- 
tion data collected on the path (top box) with the intervals 
computed by the different models {Multi-modal independent, 
MM-GMRF (our model). One mode and One mode inde- 
pendent, from top to bottom). We also display the median 
travel times as a vertical black line, both for the validation 
data and the different models. Scatter crosses, representing 
the validation travel times, are super-imposed to the results 
of each model to improve visualization. 

We notice a significant difference in the results between 
the uni- modal models {One mode and One mode indepen- 
dent) and the multi-modal models {Multi-modal indepen- 



dent, MM-GMRF). The uni-modal models tend to over- 
estimate both the median and the variance of travel times. 
These models cannot account for the difference of travel 
times due to stops on trajectories, which is one of the main 
features of arterial traffic |7|. The over-estimated variance 
illustrates why it is important to incorporate the variability 
of travel times due to stops in the structure of the model. 
On the other side, the multi- modal models are able to cap- 
ture the features of the distribution fairly accurately. The 
differences in accuracy between the Multi-modal independent 
model and the MM-GMRF model (which takes into account 
correlations in the Gaussian distribution) are not significant, 
even though the model with correlations estimates the vari- 
ability slightly more accurately. It seems that capturing the 
variability of travel times due to stop is the most important 
feature of the model. 

In Figure [6] we display the cumulative distribution of 
the validation data and the cumulative distribution of each 
model for the same paths as for Figure [s] (displayed in the 
same order). The figure displays more precisely the differ- 
ence in the estimation accuracy of the different models. As 
seen in Figure [5] the multi-modal models are more accurate 
than their uni-modal counterparts. 

Network scale validation 

Most points in the observation dataset represent different 
paths for the probe vehicles. For this reason, the distribu- 
tions cannot be compared directly. Instead, we compute the 
log-likelihood of each validation path and analyze the qual- 
ity of the travel time bounds provided by the distribution 
for each path. 



a) ?3.o 



U- ■ multi-modal indep 

▼ 'T MM-GMRF 

*"* one mode 

h-^ one mode indep 




a model, the corresponding metrics are computed as follows: 



200 300 400 

Length of the path 



■■ ■ multi-modal indep: a=0 .010, b=0 .011 

▼ 'T MM-GMRF: a=0.004,b=0.010 

*■ ■* one mode: a=0.041, b=0.033 

A— A one mode indep: a=0.039,b=0.015 




0.4 0.6 
Percentile 



Figure 7: a) Average log-likelihood of the validation 
paths (by path length), for each model, b) Valida- 
tion of the distribution percentiles for each model. 



Figure [t] a) displays the average likelihood of the valida- 
tion paths computed by the different models. The figure also 
analyzes how the path length influences the results. There 
are two motivations for doing so: (i) the length of the path 
influences the support of the distribution (longer paths are 
expected to have a larger support) which may affect the like- 
lihood and (ii) the different models may perform differently 
on different lengths as they do not take into account spatial 
dependencies in a similar way. 

As was expected from the analysis of Figures [5] and [6] 
the multi-modal models perform better than their uni-modal 
counterparts. Compared to the path validation results, the 
figure shows more significantly the effect of correlations. The 
figure shows slight improvements for the multi-modal model 
which takes into account the correlations. Surprisingly, the 
contrary is true for the uni-modal models. We also notice 
that the likelihood decreases with the length of the path, as 
we were expecting. 

Figure |7|b) analyzes the quality of the travel time distri- 
bution computed on the network. For that, we use a p-p 
plot (or percentile-percentile plot) which assesses how much 
each learned distribution matches the validation data. To 
each path p in the validation dataset corresponds an inverse 
cumulative distribution (computed from the trained 

model) and a travel time observation . A point {a, 13) 
on the curve corresponds to having /S percent of the valida- 
tion points such that < P^^(a). If the estimation was 
perfect, there would be exactly a% of the data points in the 
percentile a. To quantify how much each model deviates 
from perfect estimation, we display two metrics denoted a 
(above) and b (below) . Let / correspond to the p-p curve of 
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These values provide insight on the quality of the fit of 
the model. For example, a model with a large a-value tends 
to overestimate travel times. Similarly, a model with a large 
6- value tends to underestimate travel times. Both uni-modal 
models have large a- values. The large variance estimated by 
the models (already noticed in Figure [s]) to account for the 
variability of travel times leads to non-negligible probabil- 
ity densities for small travel times which are not physically 
possible. Compared to the likelihood validation of Figure [7| 
a), the p-p plot analyzes the quality of fit for different per- 
centiles of the distribution. In particular, we notice that 
the effect of capturing the correlations in the multi-variate 
model mostly affects the estimation of the low and high per- 
centiles in the distribution. We expect that this is due to 
the fact that correlations accounts for the impacts of slow 
vs. fast drivers or congested vs. lest congested conditions. 

5.2 Sampling 

In this section, we discuss numerical results regarding the 
quality of approximate inference using the Gibbs sampling 
method, on a fixed path on the network. 
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Figure 8: KL-divergence between the approximate 
distribution and the exact mixture distribution, as 
a function of the path length (left), and example 
distributions for a path of length 1—17 (right). 

Figure [S] shows the Kullback-Leibler divergence of the ap- 
proximate distributions, with respect to the exact distribu- 
tion. We compare two runs of the sampling, with respective 
sizes G = 1001og(/) and G = 1000 log(/) where / is the 
length of the path in links. The divergence measures the 
similarity between two distributions. As can be seen, even 
a small number of samples (relative to the total size of the 
mixture) leads to a very close approximation. 

5.3 Scaling 

In this section, we discuss the scalability aspects of the 
learning algorithm (it is clear from the discussion in Sec- 
tion |3.2| that the inference is independent from the size of the 
network). We ran the learning for networks defined by dif- 
ferent bounding boxes. The bounding boxes were adjusted 
so that the number of links in each subnetworks had different 
orders of magnitude. The longest step by far is the training 
of the GMRF. We report in Figure [9] the training time for 
different networks, all other parameters being equal. As one 
can see, the training time increases linearly with the number 
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Figure 9: Log-log plot of the training time, as a 
function of the size of the GMRF. 

of variables of the GMRF over a large range of network sizes. 
The graphs associated to each GMRF are extremely sparse: 
the average vertex degree of the largest graph is 9.46. 

6. CONCLUSIONS 

The state of the art for travel time estimation has fo- 
cused on either precise and computationally intensive physi- 
cal models, or large scale, data-driven approaches. We have 
presented a novel algorithm for travel time estimation that 
aims at combining the best of both worlds, by combining 
physical insights with some scalable algorithms. We model 
the variability of travel times due to stops at intersections 
using a Stop-&-Go filter (to detect stops) and a Markov 
model to learn the spatial dependencies between stop loca- 
tions. We also take into account the spatio-temporal corre- 
lations of travel times due to driving behavior or congestion, 
using a Gaussian Markov Random Fields. In particular, we 
present a highly scalable algorithm to train and perform in- 
ference on Gaussian Markov Random Fields, when applied 
on geographs. 

We analyze the accuracy of the model using probe vehicle 
data collected over the Bay Area of San Francisco, CA. The 
results underline the importance to take into account the 
multi-modality of travel times in arterial networks due to 
the presence of traffic signals. The quality of the results 
we obtain are competitive with the state of the state of the 
art in traffic, and also highlight the good scalability of our 
algorithm. 
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